22 research outputs found
Sampling Correctors
In many situations, sample data is obtained from a noisy or imperfect source.
In order to address such corruptions, this paper introduces the concept of a
sampling corrector. Such algorithms use structure that the distribution is
purported to have, in order to allow one to make "on-the-fly" corrections to
samples drawn from probability distributions. These algorithms then act as
filters between the noisy data and the end user.
We show connections between sampling correctors, distribution learning
algorithms, and distribution property testing algorithms. We show that these
connections can be utilized to expand the applicability of known distribution
learning and property testing algorithms as well as to achieve improved
algorithms for those tasks.
As a first step, we show how to design sampling correctors using proper
learning algorithms. We then focus on the question of whether algorithms for
sampling correctors can be more efficient in terms of sample complexity than
learning algorithms for the analogous families of distributions. When
correcting monotonicity, we show that this is indeed the case when also granted
query access to the cumulative distribution function. We also obtain sampling
correctors for monotonicity without this stronger type of access, provided that
the distribution be originally very close to monotone (namely, at a distance
). In addition to that, we consider a restricted error model
that aims at capturing "missing data" corruptions. In this model, we show that
distributions that are close to monotone have sampling correctors that are
significantly more efficient than achievable by the learning approach.
We also consider the question of whether an additional source of independent
random bits is required by sampling correctors to implement the correction
process
Deterministic Distributed Algorithms and Lower Bounds in the Hybrid Model
The HYBRID model was recently introduced by Augustine et al. [John Augustine et al., 2020] in order to characterize from an algorithmic standpoint the capabilities of networks which combine multiple communication modes. Concretely, it is assumed that the standard LOCAL model of distributed computing is enhanced with the feature of all-to-all communication, but with very limited bandwidth, captured by the node-capacitated clique (NCC). In this work we provide several new insights on the power of hybrid networks for fundamental problems in distributed algorithms.
First, we present a deterministic algorithm which solves any problem on a sparse n-node graph in ??(?n) rounds of HYBRID, where the notation ??(?) suppresses polylogarithmic factors of n. We combine this primitive with several sparsification techniques to obtain efficient distributed algorithms for general graphs. Most notably, for the all-pairs shortest paths problem we give deterministic (1 + ?)- and log n/log log n-approximate algorithms for unweighted and weighted graphs respectively with round complexity ??(?n) in HYBRID, closely matching the performance of the state of the art randomized algorithm of Kuhn and Schneider [Kuhn and Schneider, 2020]. Moreover, we make a connection with the Ghaffari-Haeupler framework of low-congestion shortcuts [Mohsen Ghaffari and Bernhard Haeupler, 2016], leading - among others - to a (1 + ?)-approximate algorithm for Min-Cut after ?(polylog (n)) rounds, with high probability, even if we restrict local edges to transfer ?(log n) bits per round. Finally, we prove via a reduction from the set disjointness problem that ??(n^{1/3}) rounds are required to determine the radius of an unweighted graph, as well as a (3/2 - ?)-approximation for weighted graphs. As a byproduct, we show an ??(n) round-complexity lower bound for computing a (4/3 - ?)-approximation of the radius in the broadcast variant of the congested clique, even for unweighted graphs
Efficient Statistics, in High Dimensions, from Truncated Samples
We provide an efficient algorithm for the classical problem, going back to
Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the
parameters of a multivariate normal distribution from truncated samples.
Truncated samples from a -variate normal means a samples is only revealed if it falls
in some subset ; otherwise the samples are hidden and
their count in proportion to the revealed samples is also hidden. We show that
the mean and covariance matrix can be
estimated with arbitrary accuracy in polynomial-time, as long as we have oracle
access to , and has non-trivial measure under the unknown -variate
normal distribution. Additionally we show that without oracle access to ,
any non-trivial estimation is impossible.Comment: to appear at 59th Annual IEEE Symposium on Foundations of Computer
Science (FOCS), 201
Secretary and Online Matching Problems with Machine Learned Advice
The classical analysis of online algorithms, due to its worst-case nature,
can be quite pessimistic when the input instance at hand is far from
worst-case. Often this is not an issue with machine learning approaches, which
shine in exploiting patterns in past inputs in order to predict the future.
However, such predictions, although usually accurate, can be arbitrarily poor.
Inspired by a recent line of work, we augment three well-known online settings
with machine learned predictions about the future, and develop algorithms that
take them into account. In particular, we study the following online selection
problems: (i) the classical secretary problem, (ii) online bipartite matching
and (iii) the graphic matroid secretary problem. Our algorithms still come with
a worst-case performance guarantee in the case that predictions are subpar
while obtaining an improved competitive ratio (over the best-known classical
online algorithm for each problem) when the predictions are sufficiently
accurate. For each algorithm, we establish a trade-off between the competitive
ratios obtained in the two respective cases
Learning Augmented Online Facility Location
Following the research agenda initiated by Munoz & Vassilvitskii [1] and
Lykouris & Vassilvitskii [2] on learning-augmented online algorithms for
classical online optimization problems, in this work, we consider the Online
Facility Location problem under this framework. In Online Facility Location
(OFL), demands arrive one-by-one in a metric space and must be (irrevocably)
assigned to an open facility upon arrival, without any knowledge about future
demands.
We present an online algorithm for OFL that exploits potentially imperfect
predictions on the locations of the optimal facilities. We prove that the
competitive ratio decreases smoothly from sublogarithmic in the number of
demands to constant, as the error, i.e., the total distance of the predicted
locations to the optimal facility locations, decreases towards zero. We
complement our analysis with a matching lower bound establishing that the
dependence of the algorithm's competitive ratio on the error is optimal, up to
constant factors. Finally, we evaluate our algorithm on real world data and
compare our learning augmented approach with the current best online algorithm
for the problem
Sample-Optimal Identity Testing with High Probability
We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)).
We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta.
An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well